Bayesian Grammar Induction for Language Modeling 
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Abstract 



We describe a corpus- based induction algo- 
rithm tor probabilistic context-tree gram- 
mars. The algorithm employs a greedy 
heuristic search within a Bayesian frame- 
work, and a post-pass using the Inside- 
Outside algorithm. We compare the per- 
formance of our algorithm to n-gram mod- 
els and the Inside-Outside algorithm in 
three language modeling tasks. In two of 
the tasks, the training data is generated by 
a probabilistic context-free grammar and in 
both tasks our algorithm outperforms the 
other techniques. The third task involves 
naturally-occurring data, and in this task 
our algorithm does not perform as well as 
n-gram models but vastly outperforms the 
Inside-Outside algorithm. 

1 Introduction 

In applications such as speech recognition, handwrit- 
ing recognition, and spelling correction, performance 
is limited by the quality of the language model uti- 



lized ( 


Bahl et al., 1978; 


Baker, 1975; 


Kernighan ct 


al., 199C; 


Srihari and Baltus, 1992 


). However, static 



language modeling performance has remained ba- 
sically unchanged since the a dvent of n-gram lan- 
guage models forty years ago ( Shannon, 1951 ). Yet, 
n-gram language models can only capture depen- 
dencies within an n-word window, where currently 
the largest practical n for natural language is three, 
and many dependencies in natural language occur 
beyond a three-word window. In addition, n-gram 
models are extremely large, thus making them diffi- 
cult to implement efficiently in memory-constrained 
applications. 

An appealing alternative is grammar-based lan- 
guage models. Language models expressed as a 
probabilistic grammar tend to be more compact 



than n-gram language models, and have the abil- 
ity to model long-distance dependencies (Lari and 
Young, 1990|; |Resnik, 1992]; |3chabes, 1992|). How- 



ever, to date there has been little success in con- 
structing grammar-based language models competi- 
tive with n-gram models in problems of any magni- 
tude. 

In this paper, we describe a corpus-based induc- 
tion algorithm for probabilistic context-free gram- 
mars that outperforms n-gram models and the 
Inside-Outside algorithm (Baker, 1979) in medium- 
sized domains. This result marks the first time 
a grammar-based language model has surpassed n- 
gram modeling in a task of at least moderate size. 
The algorithm employs a greedy heuristic search 
within a Bayesian framework, and a post-pass us- 
ing the Inside-Outside algorithm. 

2 Grammar Induction as Search 

Grammar induction can be framed as a search prob- 
lem, and has been framed as such almost without ex- 



ception in past research (Angluin and Smith, 1983). 
The search space is taken to be some class of gram- 
mars; for example, in our work we search within the 
space of probabilistic context-free grammars. The 
objective function is taken to be some measure de- 
pendent on the training data; one generally wants to 
find a grammar that in some sense accurately models 
the training data. 

Most work in language modeling, including n- 
gram models and the Inside-Outside algorithm, falls 
under the maximum-likelihood paradigm, where one 
takes the objective function to be the likelihood of 
the training data given the grammar. However, the 
optimal grammar under this objective function is 
one which generates only strings in the training data 
and no other strings. Such grammars are poor lan- 
guage models, as they overfit the training data and 
do not model the language at large. In n-gram mod- 
els and the Inside-Outside algorithm, this issue is 
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Table 1: Initial hypothesis grammar 



evaded by bounding the size and form of the gram- 
mars considered, so that the "optimal" grammar 
cannot be expressed. However, in our work we do 
not wish to limit the size of the grammars consid- 
ered. 

The basic shortcoming of the maximum-likelihood 
objective function is that it does not encompass the 
compelling intuition behind Occam's Razor, that 
simpler (or smaller) grammars are preferable over 
complex (or larger) grammars. A factor in the ob- 
jective function that favors smaller grammars over 
large can prevent the objective function from pre- 
ferring grammars that overfit the training data. 



Solomonoff (1964) presents a Bayesian grammar in- 
duction framework that includes such a factor in a 
motivated manner. 

The goal of grammar induction is taken to be find- 
ing the grammar with the largest a posteriori prob- 
ability given the training data, that is, finding the 
grammar G' where 

G' = argmaxp(G|0) 

G 

and where we denote the training data as O, for ob- 
servations. As it is unclear how to estimate p(G\0) 
directly, we apply Bayes' Rule and get 

p(0\G)p(G) 



G' 



arg max ■ 

G 



p(0) 



= arg maxp(0 1 G)p{G) 

G 



Hence, we can frame the search for G' as a search 
with the objective function p(0\G)p(G), the likeli- 
hood of the training data multiplied by the prior 
probability of the grammar. 

We satisfy the goal of favoring smaller grammars 
by choosing a prior that assigns higher probabilities 
to such grammars. In particular, Solomonoff pro- 
poses the use of the universal a priori probability 
( Bolomonoff, 1960| ), which is closely related to the 
minimum description length principle later proposed 
by ( Rissancn, 1978| ). In the case of grammatical lan- 
guage modeling, this corresponds to taking 

p(G) = 2-<( G > 



where 1(G) is the length of the description of the 
grammar in bits. The universal a priori probabil- 
ity has many elegant properties, the most salient 
of which is that it dominates all other enumerable 
probability distributions multiplicatively.[j] 

3 Search Algorithm 

As described above, we take grammar induction to 
be the search for the grammar G' that optimizes the 
objective function p(0\G)p(G). While this frame- 
work does not restrict us to a particular grammar 
formalism, in our work we consider only probabilis- 
tic context-free grammars. 

We assume a simple greedy search strategy. We 
maintain a single hypothesis grammar which is ini- 
tialized to a small, trivial grammar. We then try to 
find a modification to the hypothesis grammar, such 
as the addition of a grammar rule, that results in a 
grammar with a higher score on the objective func- 
tion. When we find a superior grammar, we make 
this the new hypothesis grammar. We repeat this 
process until we can no longer find a modification 
that improves the current hypothesis grammar. 

For our initial grammar, we choose a grammar 
that can generate any string, to assure that the 
grammar can cover the training data. The initial 
grammar is listed in Table [lj. The sentential symbol 
S expands to a sequence of X's, where X expands 
to every other nonterminal symbol in the grammar. 
Initially, the set of nonterminal symbols consists of 
a different nonterminal symbol expanding to each 
terminal symbol. 

Notice that this grammar models a sentence as 
a sequence of independently generated nonterminal 
symbols. We maintain this property throughout the 
search process, that is, for every symbol A' that we 
add to the grammar, we also add a rule X — > A' . 
This assures that the sentential symbol can expand 



1 A very thorough d iscussion of the unive rsal a priori 
probability is given by Li and Vitanyi (1992 ) . 



s 



s 




X 



X S X 

I ^\ I 

-^slowly S X -^-slowly 

I I I I I 

Ataiks slowly X Ataiks slowly 

I I I 

talks Afloi talks 

I 

Bob 



S 
I 

X 



.4 



Figure 1: Initial Viterbi Parse 
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Figure 2: Predicted Viterbi Parse 



to every symbol; otherwise, adding a symbol will not 
affect the probabilities that the grammar assigns to 
strings. 

We use the term move set to describe the set of 
modifications we consider to the current hypothesis 
grammar to hopefully produce a superior grammar. 
Our move set includes the following moves: 

Move 1: Create a rule of the form A — > BC 

Move 2: Create a rule of the form A — > B\C 

For any context-free grammar, it is possible to ex- 
press a weakly equivalent grammar using only rules 
of these forms. As mentioned before, with each new 
symbol A we also create a rule X — > A. 

3.1 Evaluating the Objective Function 

Consider the task of calculating the objective func- 
tion p(0\G)p(G) for some grammar G. Calculating 
p(G) — 2~'( G ) is inexpensive^; however, calculating 
p(0\G) requires a parsing of the entire training data. 
We cannot afford to parse the training data for each 
grammar considered; indeed, to ever be practical for 

2 Due to space limitations, we do not specify our 
method for encoding grammars, i.e., how we calculate 
1(G) for a given G. However, this will be described in 
the author's forthcoming Ph.D. dissertation. 



data sets of millions of words, it seems likely that we 
can only afford to parse the data once. 

To achieve this goal, we employ several approx- 
imations. First, notice that we do not ever need 
to calculate the actual value of the objective func- 
tion; we need only to be able to distinguish when 
a move applied to the current hypothesis grammar 
produces a grammar that has a higher score on the 
objective function, that is, we need only to be able 
to calculate the difference in the objective function 
resulting from a move. This can be done efficiently 
if we can quickly approximate how the probability 
of the training data changes when a move is applied. 

To make this possible, we approximate the proba- 
bility of the training datap(0|G) by the probability 
of the single most probable parse, or Viterbi parse, 
of the training data. Furthermore, instead of recal- 
culating the Viterbi parse of the training data from 
scratch when a move is applied, we use heuristics to 
predict how a move will change the Viterbi parse. 
For example, consider the case where the training 
data consists of the two sentences 

O = {Bob talks slowly, Mary talks slowly} 

In Figure [l[ we display the Viterbi parse of this data 
under the initial hypothesis grammar used in our 
algorithm. 



Now, let us consider the move of adding the rule 

B ► ^-talks ^-slowly 

to the initial grammar (as well as the concomitant 
rule X — > B). A reasonable heuristic for predict- 
ing how the Viterbi parse will change is to replace 
adjacent X's that expand to A ta iks and A s i ow i y re- 
spectively with a single X that expands to £>, as 
displayed in Figure @. This is the actual heuristic 
we use for moves of the form A — * BC, and we have 
analogous heuristics for each move in our move set. 
By predicting the differences in the Viterbi parse re- 
sulting from a move, we can quickly estimate the 
change in the probability of the training data. 

Notice that our predicted Viterbi parse can stray 
a great deal from the actual Viterbi parse, as errors 
can accumulate as move after move is applied. To 
minimize these effects, we process the training data 
incrementally. Using our initial hypothesis gram- 
mar, we parse the first sentence of the training data 
and search for the optimal grammar over just that 
one sentence using the described search framework. 
We use the resulting grammar to parse the second 
sentence, and then search for the optimal grammar 
over the first two sentences using the last grammar 
as the starting point. We repeat this process, pars- 
ing the next sentence using the best grammar found 
on the previous sentences and then searching for the 
best grammar taking into account this new sentence, 
until the entire training corpus is covered. 

Delaying the parsing of a sentence until all of the 
previous sentences are processed should yield more 
accurate Viterbi parses during the search process 
than if we simply parse the whole corpus with the 
initial hypothesis grammar. In addition, we still 
achieve the goal of parsing each sentence but once. 

3.2 Parameter Training 

In this section, we describe how the parameters of 
our grammar, the probabilities associated with each 
grammar rule, are set. Ideally, in evaluating the ob- 
jective function for a particular grammar we should 
use its optimal parameter settings given the training 
data, as this is the full score that the given grammar 
can achieve. However, searching for optimal param- 
eter values is extremely expensive computationally. 
Instead, we grossly approximate the optimal values 
by deterministically setting parameters based on the 
Viterbi parse of the training data parsed so far. We 
rely on the post-pass, described later, to refine pa- 
rameter values. 

Referring to the rules in Table ^, the parameter e is 
set to an arbitrary small constant. The values of the 
parameters p(A) are set to the (smoothed) frequency 



of the X — > A reduction in the Viterbi parse of the 
data seen so far. The remaining symbols are set to 
expand uniformly among their possible expansions. 

3.3 Constraining Moves 

Consider the move of creating a rule of the form 
A — > BC. This corresponds to fc 3 different specific 
rules that might be created, where k is the current 
number of symbols in the grammar. As it is too 
computationally expensive to consider each of these 
rules at every point in the search, we use heuristics 
to constrain which moves are appraised. 

For the left-hand side of a rule, we always cre- 
ate a new symbol. This heuristic selects the opti- 
mal choice the vast majority of the time; however, 
under this constraint the moves described earlier in 
this section cannot yield arbitrary context-free lan- 
guages. To partially address this, we add the move 

Move 3: Create a rule of the form A — > AB\B 

With this iteration move, we can construct gram- 
mars that generate arbitrary regular languages. As 
yet, we have not implemented moves that enable 
the construction of arbitrary context-free grammars; 
this belongs to future work. 

To constrain the symbols we consider on the right- 
hand side of a new rule, we use what we call trig- 
gers^ A trigger is a phenomenon in the Viterbi 
parse of a sentence that is indicative that a particu- 
lar move might lead to a better grammar. For exam- 
ple, in Figure [l] the fact that the symbols A ta iks and 
Asiowiy occur adjacently is indicative that it could 
be profitable to create a rule B — > A ta iksA s i ow iy. We 
have developed a set of triggers for each move in our 
move set, and only consider a specific move if it is 
triggered in the sentence currently being parsed in 
the incremental processing. 

3.4 Post-Pass 

A conspicuous shortcoming in our search framework 
is that the grammars in our search space are fairly 
unexpressive. Firstly, recall that our grammars 
model a sentence as a sequence of independently gen- 
erated symbols; however, in language there is a large 
dependence between adjacent constituents. Further- 
more, the only free parameters in our search are the 
parameters p(A); all other symbols (except S) are 
fixed to expand uniformly. These choices were nec- 
essary to make the search tractable. 

To address this issue, we use an Inside-Outside al- 
gorithm post-pass. Our methodology is derived from 

3 This is not to be confused with the use of the term 
triggers in dynamic language modeling. 



that described by Lari and Young (1990). We cre- 
ate n new nonterminal symbols {X\, . . . ,X n }, and 
create all rules of the form: 



Xi 
Xi 



Xj x k 
A 



i,j,k G{l,...,n} 
i 6 {1, .. . ,n}, 
A e N old ■ 



{S,X} 



N i,i denotes the set of nonterminal symbols ac- 
quired in the initial grammar induction phase, and 
X\ is taken to be the new sentential symbol. These 
new rules replace the first three rules listed in Table 
[j]. The parameters of these rules are initialized ran- 
domly. Using this grammar as the starting point, 
we run the Inside-Outside algorithm on the training 
data until convergence. 

In other words, instead of using the naive S — > 
SX\X rule to attach symbols together in parsing 
data, we now use the Xi rules and depend on the 
Inside-Outside algorithm to train these randomly 
initialized rules intelligently. This post-pass allows 
us to express dependencies between adjacent sym- 
bols. In addition, it allows us to train parameters 
that were fixed during the initial grammar induc- 
tion phase. 

4 Previous Work 



As mentioned, this work employs the Bayesian gram- 
mar inducti on framework described by Solomonoff 
( 1960 ; 1964 ). However, Solomonoff does not specify 
a concrete search algorithm and only makes sugges- 
tions as to its nature. 

Similar research includes work by Cook et al. 
(1976) and Stolcke and Omohundro (1994). This 
work also employs a heuristic search within a 
Bayesian framework. However, a different prior 
probability on grammars is used, and the algorithms 
are only efficient enough to be applied to small data 
sets. 

The grammar induction algorithms most suc- 
cessful in language modeling include the Insidc- 

Lari 



Outside algorithm (Lari and Young, 1990 



and Young, 1991; Percira and Schabes, 1992), a 
special case of the Expectation-Maximization al- 
gorithm (Dempster et al., 1977), and work by 



McCandless and Glass (1993). In the latter work, 
McCandlcss uses a heuristic search procedure simi- 
lar to ours, but a very different search criteria. To 
our knowledge, neither algorithm has surpassed the 
performance of n-gram models in a language model- 
ing task of substantial scale. 

5 Results 

To evaluate our algorithm, we compare the perfor- 
mance of our algorithm to that of n-gram models 



and the Inside-Outside algorithm. 

For n-gram models, we tried n — 1, . . . , 10 for each 
domain. For smoothing a particular n-gram model, 
we took a linear combination of all lower order n- 
gram models. In particular, we follow standard prac- 
tice Qjelmek and Mercer, 1980| ; [Bahl et al., 1983) ; 
Brown et al., 1992) and take the smoothed i-gram 



probability to be a linear combination of the i-gram 
frequency in the training data and the smoothed 
[i — l)-gram probability, that is, 

p(wo\W = Wi-x ■ ■ ■ W-\) = 
c(Ww ) 

(1 - \,c(W))p{ w o\Wi-2 ■ ••Itf-l) 

where c(W) denotes the count of the word sequence 
W in the training data. The smoothing parameters 
Ai iC are trained through the Forward-Backward al- 
gorithm (Baum and Eagon, 1967) on held-out data. 
Parameters Ai iC are tied together for similar c to pre- 
vent data sparsity. 

For the Inside-Outside algorithm, we follow the 
methodology described by Lari and Young. For a 
given n, we create a probabilistic context-free gram- 
mar consisting of all Chomsky normal form rules 
over the n nonterminal symbols {Xi, . . . X n } and the 
given terminal symbols, that is, all rules 



x, 



Xi Xi. 



i,j,k G {1, . . . ,n} 
i € {1, . . • , n}, a <E T 



where T denotes the set of terminal symbols in the 
domain. All parameters are initialized randomly. 
From this starting point, the Inside-Outside algo- 
rithm is run until convergence. 

For smoothing, we combine the expansion distri- 
bution of each symbol with a uniform distribution, 
that is, we take the smoothed parameter p s (A — > a) 
to be 



p a (A -> a) = (1 - X)pu(A ^a) + X- 



1 



nT 



where p u (A — > a) denotes the unsmoothed parame- 
ter. The value n 3 + n\T\ is the number of different 
ways a symbol expands under the Lari and Young 
methodology. The parameter A is trained through 
the Inside-Outside algorithm on held-out data. This 
smoothing is also performed on the Inside-Outside 
post-pass of our algorithm. For each domain, we 
tried n = 3, . . . , 10. 

Because of the computational demands of our 
algorithm, it is currently impractical to apply it 
to large vocabulary or large training set problems. 
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entropy 


entr. relative 
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(bits/word) 
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ideal grammar 




2.30 


-6.5% 


our algorithm 
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2.37 


-3.7% 


n-gram model 


4 


2.46 




Inside-Outsidc 


9 


2.60 


+5.7% 


Table 2: English- like artificial grammar 




best 


entropy 


entr. relative 




n 


(bits/word) 


to n-gram 


ideal grammar 




4.13 


-10.4% 


our algorithm 


9 


4.44 


-3.7% 


n-gram model 


4 


4.61 




Inside-Outside 


9 


4.64 


+0.7% 



Table 3: Wall Street Journal- like artificial grammar 



However, we present the results of our algorithm in 
three medium-sized domains. In each case, we use 
4500 sentences for training, with 500 of these sen- 
tences held out for smoothing. We test on 500 sen- 
tences, and measure performance by the entropy of 
the test data. 

In the first two domains, we created the train- 
ing and test data artificially so as to have an ideal 
grammar in hand to benchmark results. In particu- 
lar, we used a probabilistic grammar to generate the 
data. In the first domain, we created this grammar 
by hand; the grammar was a small English-like prob- 
abilistic context-free grammar consisting of roughly 
10 nonterminal symbols, 20 terminal symbols, and 
30 rules. In the second domain, we derived the gram- 
mar from manually parsed text. From a million 
words of parsed Wall Street Journal data from the 
Penn treebank, we extracted the 20 most frequently 
occurring symbols, and the 10 most frequently oc- 
curring rules expanding each of these symbols. For 
each symbol that occurs on the right-hand side of 
a rule but which was not one of the most frequent 
20 symbols, we create a rule that expands that sym- 
bol to a unique terminal symbol. After removing 
unreachable rules, this yields a grammar of roughly 
30 nonterminals, 120 terminals, and 160 rules. Pa- 
rameters are set to reflect the frequency of the cor- 
responding rule in the parsed corpus. 

For the third domain, we took English text and 
reduced the size of the vocabulary by mapping each 
word to its part-of-speech tag. We used tagged Wall 
Street Journal text from the Penn treebank, which 
has a tag set size of about fifty. 

In Tables 0-0, we summarize our results. The 



ideal grammar denotes the grammar used to gener- 
ate the training and test data. For each algorithm, 
we list the best performance achieved over all n tried, 
and the best n column states which value realized 
this performance. 

We achieve a moderate but significant improve- 
ment in performance over n-gram models and the 
Inside-Outside algorithm in the first two domains, 
while in the part-of-speech domain we are outper- 
formed by n-gram models but we vastly outperform 
the Inside-Outside algorithm. 

In Table ^, we display a sample of the number 
of parameters and execution time (on a Decstation 
5000/33) associated with each algorithm. We choose 
n to yield approximately equivalent performance for 
each algorithm. The first pass row refers to the main 
grammar induction phase of our algorithm, and the 
post-pass row refers to the Inside-Outside post-pass. 

Notice that our algorithm produces a significantly 
more compact model than the n-gram model, while 
running significantly faster than the Inside-Outside 
algorithm even though we use an Inside-Outside 
post-pass. Part of this discrepancy is due to the fact 
that we require a smaller number of new nonterminal 
symbols to achieve equivalent performance, but we 
have also found that our post-pass converges more 
quickly even given the same number of nonterminal 
symbols. 

6 Discussion 

Our algorithm consistently outperformed the Inside- 
Outside algorithm in these experiments. While we 
partially attribute this difference to using a Bayesian 
instead of maximum-likelihood objective function, 
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3.15 


+4.7% 


Inside-Outsidc 


7 


3.93 


+30.6% 



Table 4: English sentence part-of-speech sequences 



WSJ 


n 


entropy 


no. 


time 


artif. 




(bits/ word) 


params 


(sec) 


n-gram 


3 


4.61 


15000 


50 


10 


9 


4.64 


2000 


30000 


first pass 






800 


1000 


post-pass 


5 


4.60 


4000 


5000 



Table 5: Parameters and Training Time 



we believe that part of this difference results from a 
more effective search strategy. In particular, though 
both algorithms employ a greedy hill-climbing strat- 
egy, our algorithm gains an advantage by being able 
to add new rules to the grammar. 

In the Inside-Outside algorithm, the gradient de- 
scent search discovers the "nearest" local minimum 
in the search landscape to the initial grammar. If 
there are k rules in the grammar and thus k pa- 
rameters, then the search takes place in a fixed k- 
dimensional space K fe . In our algorithm, it is possi- 
ble to expand the hypothesis grammar, thus increas- 
ing the dimensionality of the parameter space that 
is being searched. An apparent local minimum in 
the space R fe may no longer be a local minimum in 
the space R fe+1 ; the extra dimension may provide a 
pathway for further improvement of the hypothesis 
grammar. Hence, our algorithm should be less prone 
to suboptimal local minima than the Inside-Outside 
algorithm. 

Outperforming n-gram models in the first two do- 
mains demonstrates that our algorithm is able to 
take advantage of the grammatical structure present 
in data. However, the superiority of n-gram models 
in the part-of-speech domain indicates that to be 
competitive in modeling naturally-occurring data, it 
is necessary to model collocational information ac- 
curately. We need to modify our algorithm to more 
aggressively model n-gram information. 

7 Conclusion 

This research represents a step forward in the quest 
for developing grammar-based language models for 
natural language. We induce models that, while be- 
ing substantially more compact, outperform n-gram 



language models in medium-sized domains. The al- 
gorithm runs essentially in time and space linear in 
the size of the training data, so larger domains are 
within our reach. 

However, we feel the largest contribution of this 
work docs not lie in the actual algorithm specified, 
but rather in its indication of the potential of the in- 
duction framework described by Solomonoff in 1964. 
We have implemented only a subset of the moves 
that we have developed, and inspection of our re- 
sults gives reason to believe that these additional 
moves may significantly improve the performance of 
our algorithm. 

Solomonoff's induction framework is not re- 
stricted to probabilistic context-free grammars. Af- 
ter completing the implementation of our move 
set, we plan to explore the modeling of context- 
sensitive phenomena. This work demonstrates that 
Solomonoff's elegant framework deserves much fur- 
ther consideration. 
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